Welcome! In this notebook we will explore the use of autoencoder neural networks to learn low-dimensional representations of financial transactions and identify a representative sample for financial statement audits.
This notebook is largely inspired by a paper published in 2020 by Marco Schreyer, Timur Sattarov, Anita Gierbl, Bernd Reimer, and Damian Borth, entitled Learning Sampling in Financial Statement Audits using Vector Quantised Autoencoder Neural Networks, as well as an excellent demonstration of autoencoders by Alexander Van de Kleut, entitled Variational Autoencoders (VAE) with PyTorch
Auditing standards require the assessment of the underlying transactions that comprise the financial statements to detect errors or fraud that would result in material misstatement. The accounting profession has developed a framework for addressing this requirement, known as the Audit Risk Model.
The Audit Risk Model defines audit risk as the combination of inherent risk, control risk and detection risk:
$Audit Risk = Inherent Risk \times Control Risk \times Detection Risk$
Detection risk is composed of sampling risk and non-sampling risk:
$Detection Risk= Sampling Risk + Non$-$Sampling Risk$
We focus our attention on sampling risk, which is defined as the risk that the auditor's conclusion based on the sample would be different had the entire population been tested. In other words, it is the risk that the sample is not representative of the population and does not provide sufficient appropriate audit evidence to detect material misstatements.
There are a variety of sampling methods used by auditors. Random sampling is based on each member of the population having an equal chance of being selected. Stratified sampling subdivides the population into homogenous groups from which to make selections. Monetary unit sampling treats each dollar amount from the population as the sampling unit and selects items when a cumulative total meets or exceeds a predefined sampling interval when cycling through the population.
Autoencoders offer an alternative method for addressing sampling risk. An autoencoder is a neural network that learns to encode data into lower dimensions and decode it back into higher dimensions. The resulting model provides a low-dimensional representation of the data, disentangling it in a way that reveals something about the fundamental structure. Auditors can model transactions in this way and select from low-dimensional clusters. They can also identify anomalous transactions based on how much they deviate from other transactions in this latent space.
To demonstrate the use of autoencoders for financial transaction analysis, we will use the City of Philadelphia payments data. It is one of two datasets used in Schreyer et al (2020) and consists of nearly a quarter-million payments from about 60 city offices, departments, boards and commissions. It covers the City's fiscal year 2017 (July 2016 through June 2017) and represents nearly $4.2 billion in payments during that period.
Let's begin by reading in the modules we'll use to analyze these data:
# import preprocessing modules
import time
import datetime
import calendar
import numpy as np
import pandas as pd
import datapane as dp
import plotly.express as px
import plotly.graph_objects as go
from collections import defaultdict
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
# import pytorch modules
import torch
torch.manual_seed(0)
import torch.nn as nn
import torch.distributions
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
# use GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
Read in the data:
df = pd.read_csv('data/city_payments_fy2017.csv')
View the first five rows:
df.head()
| fy | fm | check_date | document_no | dept | department_title | char_ | character_title | sub_obj | sub_obj_title | vendor_name | doc_ref_no_prefix | doc_ref_no_prefix_definition | contract_number | contract_description | transaction_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017 | 10 | 2017-04-12 | CHEK17119771 | 42 | 42 COMMERCE | 2 | 02 PURCHASE OF SERVICES | 231 | OVERTIME MEALS 0231 | EAT AT JOE'S | PVXX | payment voucher | NaN | NaN | 66.82 |
| 1 | 2017 | 12 | 2017-06-09 | ACHD17177233 | 26 | 26 LICENSES & INSPECTIONS | 2 | 02 PURCHASE OF SERVICES | 211 | TRANSPORTATION 0211 | L & I Â TRAVEL IMPREST FUND | PVXX | payment voucher | NaN | NaN | 127.33 |
| 2 | 2017 | 5 | 2016-11-25 | CHEK17063736 | 44 | 44 LAW | 2 | 02 PURCHASE OF SERVICES | 258 | COURT REPORTERS 0258 | MARLENE BELL REPORTING, INC. | PVXX | payment voucher | NaN | NaN | 454.20 |
| 3 | 2017 | 1 | 2016-07-07 | CHEK17000247 | 11 | 11 POLICE | 2 | 02 PURCHASE OF SERVICES | 260 | REPAIR AND MAINTENANCE CHARGES 0260 | RICOH AMERICAS CORPORATION | VCXX | procurement | NaN | NaN | 50.00 |
| 4 | 2017 | 1 | 2016-07-08 | ACHD17000233 | 23 | 23 PRISONS | 3 | 03 MATERIALS AND SUPPLIES | 313 | FOOD 0313 | PHILADELPHIA PRISONS | PCXX | petty cash | NaN | NaN | 71.92 |
Check the dimensions of the data:
df.shape
(238894, 16)
We can see that several features of this dataset are redundant and some records have NaN ("Not a number") values. Next, we will clean up the data to fill in NaNs with "None". This is reasonable since it is informational that there are some payments that do not come with a contract number or description and we want to retain this information for modeling. We'll also remove redundant columns and add new features for amount_sign, payment_method, weekday, day, month, and year.
%%time
# clean up the data
df.department_title = df.department_title.str.split().apply(
lambda x: ' '.join(x[1:len(x)]))
df.character_title = df.character_title.str.split().apply(
lambda x: ' '.join(x[1:len(x)]))
df.sub_obj_title = df.sub_obj_title.str.split().apply(
lambda x: ' '.join(x[0:len(x)-1]))
df['amount_sign'] = df.transaction_amount.apply(
lambda x: 1 if x >= 0 else 0)
df.transaction_amount = df.transaction_amount.abs()
df['payment_method'] = df.document_no.str.split('1').apply(
lambda x: x[0])
df['weekday'] = df.check_date.str.split('-').apply(
lambda x: calendar.day_name[
datetime.datetime(int(x[0]),
int(x[1]),
int(x[2])).weekday()])
df['day'] = df.check_date.str.split('-').apply(
lambda x: x[2])
df['month'] = df.check_date.str.split('-').apply(
lambda x: calendar.month_name[int(x[1])])
df['year'] = df.check_date.str.split('-').apply(
lambda x: x[0])
df.drop(['fy', 'fm', 'check_date', 'dept', 'char_', 'sub_obj',
'doc_ref_no_prefix','contract_number'],
axis=1, inplace=True)
df.fillna('None', inplace=True)
CPU times: total: 6.02 s Wall time: 6.19 s
Here is the result of this initial preprocessing:
df.head()
| document_no | department_title | character_title | sub_obj_title | vendor_name | doc_ref_no_prefix_definition | contract_description | transaction_amount | amount_sign | payment_method | weekday | day | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CHEK17119771 | COMMERCE | PURCHASE OF SERVICES | OVERTIME MEALS | EAT AT JOE'S | payment voucher | None | 66.82 | 1 | CHEK | Wednesday | 12 | April | 2017 |
| 1 | ACHD17177233 | LICENSES & INSPECTIONS | PURCHASE OF SERVICES | TRANSPORTATION | L & I Â TRAVEL IMPREST FUND | payment voucher | None | 127.33 | 1 | ACHD | Friday | 09 | June | 2017 |
| 2 | CHEK17063736 | LAW | PURCHASE OF SERVICES | COURT REPORTERS | MARLENE BELL REPORTING, INC. | payment voucher | None | 454.20 | 1 | CHEK | Friday | 25 | November | 2016 |
| 3 | CHEK17000247 | POLICE | PURCHASE OF SERVICES | REPAIR AND MAINTENANCE CHARGES | RICOH AMERICAS CORPORATION | procurement | None | 50.00 | 1 | CHEK | Thursday | 07 | July | 2016 |
| 4 | ACHD17000233 | PRISONS | MATERIALS AND SUPPLIES | FOOD | PHILADELPHIA PRISONS | petty cash | None | 71.92 | 1 | ACHD | Friday | 08 | July | 2016 |
Next, we'll explore the data to get a sense of its characteristics.
df.describe().round()
| transaction_amount | amount_sign | |
|---|---|---|
| count | 238894.0 | 238894.0 |
| mean | 17574.0 | 1.0 |
| std | 436585.0 | 0.0 |
| min | 0.0 | 0.0 |
| 25% | 66.0 | 1.0 |
| 50% | 250.0 | 1.0 |
| 75% | 1236.0 | 1.0 |
| max | 99000000.0 | 1.0 |
Let's have a look at the distribution of the transaction amounts using a histogram with 100 bins:
fig = px.histogram(df, x='transaction_amount', log_y=True, nbins=100,
title='Histogram of Transaction Amounts', width=900)
fig.update_traces(
marker_line_width=1,
marker_line_color='black',
marker=dict(color='gold'),
)
fig.update_xaxes(title='Transaction amount ($ millions)',
range=[-2000000,102000000],
gridcolor='lightgray',
showgrid=True,
gridwidth=1)
fig.update_yaxes(title='Count (log-scaled)',
gridcolor='lightgray',
showgrid=True,
gridwidth=1)
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
hoverlabel=dict(
bgcolor='ivory',
font_size=16,
font_family='Rockwell'),
font=dict(family='Rockwell',
color='navy',
size=16),
title_x=0.5)
fig.show()
It looks like the majority of transactions are of an amount equal to or less than $500,000. It is also apparent that there are several outlier payments that are of a relatively high dollar amount. Next, let's look at the spend by city department:
top_twenty_depts = (df.department_title.value_counts()\
/ df.department_title.value_counts().sum()\
* 100).head(20)[::-1]
top_twenty_depts = top_twenty_depts.reset_index().rename(
columns={'department_title':'share',
'index':'department'})
fig = px.bar(top_twenty_depts, x='share', y='department',
title='Top Twenty Departments By Share of Spend',
width=700, orientation='h')
fig.update_traces(marker_line_width=1,
marker_line_color='black',
marker=dict(color='lightgreen'),
hovertemplate='<br>'.join([
'Department: %{y}',
'Share: %{x:.2f}%']))
fig.update_xaxes(title='Share of spend',
range=[-0.25,23],
gridcolor='lightgray',
showgrid=True,
gridwidth=1,
tickvals=[0,5,10,15,20],
ticktext=[f'{x}%' for x in range(0,21,5)])
fig.update_yaxes(title='',
gridcolor='lightgray',
showgrid=True,
gridwidth=1)
fig.update_layout(title_font_size=20,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
showlegend=False,
hoverlabel=dict(
bgcolor='ivory',
font_size=16,
font_family='Rockwell'),
font=dict(family='Rockwell',
color='navy',
size=12),
title_x=0.5)
fig.show()
Of the 58 city departments included in the dataset, 60% of spend goes towards fleet management, managing director, human resources, water and health. What about the nature of these expenses? The document reference feature sheds some light on this:
doc_ref_df = df.doc_ref_no_prefix_definition.value_counts().reset_index()
doc_ref_df.doc_ref_no_prefix_definition = doc_ref_df.doc_ref_no_prefix_definition\
/ doc_ref_df.doc_ref_no_prefix_definition.sum() * 100
fig = px.pie(doc_ref_df, names='index', values='doc_ref_no_prefix_definition',
labels='index', title='Spend By Document Reference'+\
'<br><sup>hover over to see labels</sup>',
hole=0.5, color_discrete_sequence=px.colors.qualitative.Set3,
width=500, height=500).update_traces(marker_line_color='rgb(0,0,0)',
textinfo='none',
hovertemplate='<br>'.join([
'doc ref: %{label}',
'share of spend: '+\
'%{value:.2f}%']),
textfont_size=14,
marker_line_width=1)
fig.update_layout(title_font_size=20,
showlegend=False,
hoverlabel=dict(
font_size=16,
font_family='Rockwell'),
font=dict(family='Rockwell',
color='navy',
size=12),
title_x=0.5)
fig.show()
Hovering over the doughnut plot above, we find the following breakdown of expenses by document reference:
What days do payments tend to be made on?
day_counts = df.day.value_counts().sort_index().reset_index()
day_counts.rename(columns={'index':'day','day':'count'}, inplace=True)
fig = px.line(day_counts, x='day', y='count',
title='Frequency of Transactions by Day of the Month')
fig.update_traces(mode='markers+lines')
fig.update_xaxes(gridcolor='lightgray',
showgrid=True,
gridwidth=1)
fig.update_yaxes(gridcolor='lightgray',
showgrid=True,
gridwidth=1)
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
title_font_size=20,
showlegend=False,
hoverlabel=dict(
bgcolor='ivory',
font_size=16,
font_family='Rockwell'),
font=dict(family='Rockwell',
color='navy',
size=14),
title_x=0.5)
fig.show()
While payments are daily, it looks like there ae four times in a month when they are more frequent. This might coincide with the payroll cycle or reflect vendor payment terms (i.e. 2/10 net 30).
Now that we have a sense of our dataset, let's prepare it for modeling. Note that the ETL steps in this section are collected in the preprocessing.py module provided in this repository.
We will need our data to be embedded in a vector space so we can apply numerical analysis to it. Let's begin by getting the features that are categorical and numerical:
# determine categorical and numerical features
numerical = df.select_dtypes(include=['int64', 'float64']).columns
categorical = df.select_dtypes(include=['object', 'bool']).columns
print(numerical)
print(categorical)
Index(['transaction_amount', 'amount_sign'], dtype='object')
Index(['document_no', 'department_title', 'character_title', 'sub_obj_title',
'vendor_name', 'doc_ref_no_prefix_definition', 'contract_description',
'payment_method', 'weekday', 'day', 'month', 'year'],
dtype='object')
Create a dictionary to store the feature names to help with plotting later:
# get feature names for plotting
feature_names = dict(zip(range(len(categorical) +\
len(numerical)-1),
categorical.tolist()[1:] +\
numerical.tolist()))
feature_names
{0: 'department_title',
1: 'character_title',
2: 'sub_obj_title',
3: 'vendor_name',
4: 'doc_ref_no_prefix_definition',
5: 'contract_description',
6: 'payment_method',
7: 'weekday',
8: 'day',
9: 'month',
10: 'year',
11: 'transaction_amount',
12: 'amount_sign'}
Next, we instantiate a LabelEncoder to encode the categorical features, replacing strings with nominal numerical values. This will allow us to set the color of plot markers to the different values of categorical features:
le = defaultdict(LabelEncoder)
# Label encoding
Y = df[categorical].apply(lambda x: le[x.name].fit_transform(x))
Y = pd.concat([Y, df[numerical]], axis=1)
Y.head()
| document_no | department_title | character_title | sub_obj_title | vendor_name | doc_ref_no_prefix_definition | contract_description | payment_method | weekday | day | month | year | transaction_amount | amount_sign | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 212404 | 13 | 4 | 104 | 1560 | 7 | 1120 | 1 | 4 | 11 | 0 | 1 | 66.82 | 1 |
| 1 | 106470 | 24 | 4 | 201 | 2894 | 7 | 1120 | 0 | 0 | 8 | 6 | 1 | 127.33 | 1 |
| 2 | 168343 | 23 | 4 | 32 | 3211 | 7 | 1120 | 1 | 0 | 24 | 9 | 0 | 454.20 | 1 |
| 3 | 119090 | 44 | 4 | 180 | 4274 | 9 | 1120 | 1 | 2 | 6 | 5 | 0 | 50.00 | 1 |
| 4 | 209 | 45 | 3 | 50 | 3959 | 8 | 1120 | 0 | 0 | 7 | 5 | 0 | 71.92 | 1 |
Now, for purposes of this demonstration, we will use a subset of the data for faster training:
# take a sample that will become our X matrix
X_sample = df.sample(n=128*373, random_state=1729).drop(
columns='document_no')
# take another sample using the same random_sate
# thus, same indices as df_sample, to use to recover labels
Y_sample = Y.sample(n=128*373, random_state=1729)
X_sample.shape , Y_sample.shape
((47744, 13), (47744, 14))
While we are at it, we'll also define a function to conveniently recover the labels from the label encoded Y matrix:
def recover_labels(Y):
'''
Recover data labels from tensor to pandas DataFrame
by applying the inverse transform of the label encoder
on the label encoded payments data
Parameters:
Y: torch.Tensor of the label encoded payments data
Returns:
recovered_Y: pandas DataFrame of payments data
with interpretable labels instead of encoded numbers
'''
recovered_Y = pd.concat([
pd.DataFrame(Y, columns=Y_sample.columns)[categorical]\
.astype('int32').apply(lambda x: le[x.name].inverse_transform(x)),
pd.DataFrame(Y, columns=Y_sample.columns)[numerical]
], axis=1)
return recovered_Y
Let's try this out. Here is the label encoded Y_sample ...
Y_sample.head()
| document_no | department_title | character_title | sub_obj_title | vendor_name | doc_ref_no_prefix_definition | contract_description | payment_method | weekday | day | month | year | transaction_amount | amount_sign | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 214627 | 197810 | 25 | 4 | 84 | 5475 | 3 | 1120 | 1 | 2 | 1 | 7 | 1 | 250.00 | 1 |
| 46641 | 223236 | 25 | 4 | 84 | 3606 | 3 | 1120 | 1 | 2 | 17 | 8 | 1 | 150.00 | 1 |
| 119334 | 148015 | 39 | 4 | 179 | 4007 | 9 | 1461 | 1 | 2 | 5 | 10 | 0 | 853.00 | 1 |
| 17132 | 7159 | 21 | 3 | 100 | 4688 | 9 | 1560 | 0 | 0 | 28 | 5 | 0 | 14.63 | 1 |
| 131162 | 155558 | 19 | 3 | 96 | 2205 | 2 | 1120 | 1 | 2 | 19 | 10 | 0 | 25.62 | 1 |
... and here is the same Y_sample with the labels recovered:
recover_labels(Y_sample).head()
| document_no | department_title | character_title | sub_obj_title | vendor_name | doc_ref_no_prefix_definition | contract_description | payment_method | weekday | day | month | year | transaction_amount | amount_sign | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 214627 | CHEK17101085 | MANAGING DIRECTOR | PURCHASE OF SERVICES | LEGAL SERVICES | YALONDA E HOUSTON | court appointed attorneys | None | CHEK | Thursday | 02 | March | 2017 | 250.00 | 1 |
| 46641 | CHEK17133085 | MANAGING DIRECTOR | PURCHASE OF SERVICES | LEGAL SERVICES | NGHI D VO | court appointed attorneys | None | CHEK | Thursday | 18 | May | 2017 | 150.00 | 1 |
| 119334 | CHEK17037586 | OFFICE OF HOUSING | PURCHASE OF SERVICES | RENTS | PITNEY BOWES INCORPORATED | procurement | Rental of Pitney Bowes Mailing Equipment | CHEK | Thursday | 06 | October | 2016 | 853.00 | 1 |
| 17132 | ACHD17013816 | HEALTH | MATERIALS AND SUPPLIES | OFFICE MATERIALS AND SUPPLIES | STAPLES BUSINESS ADVANTAGE | procurement | STAPLES CONTRACT & COMMERCIAL INC Bid # 010615NJ | ACHD | Friday | 29 | July | 2016 | 14.63 | 1 |
| 131162 | CHEK17047593 | FLEET MANAGEMENT | MATERIALS AND SUPPLIES | MOTOR VEHICLE PARTS AND ACCESSORIES | IEH AUTO PARTS LLC | auto parts | None | CHEK | Thursday | 20 | October | 2016 | 25.62 | 1 |
Next, let's one-hot-encode the categorical features (everything except for transaction amount):
%%time
# create dummy variables to one-hot-encode
X_sample = pd.get_dummies(X_sample, drop_first=True)
CPU times: total: 719 ms Wall time: 892 ms
Partition the data into train and test sets:
%%time
# create training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(
X_sample, Y_sample, test_size=128*73, random_state=1729)
CPU times: total: 1.06 s Wall time: 1.12 s
Min-max scale transaction amount:
# min-max scale the transaction amount
X_train.transaction_amount = MinMaxScaler().fit_transform(
X_train.transaction_amount.to_numpy().reshape(-1,1))
X_test.transaction_amount = MinMaxScaler().fit_transform(
X_test.transaction_amount.to_numpy().reshape(-1,1))
Convert pandas DataFrames into Pytorch FloatTensors:
# create tensors
X_train = torch.FloatTensor(X_train.values)
X_test = torch.FloatTensor(X_test.values)
Y_train = torch.FloatTensor(Y_train.values)
Y_test = torch.FloatTensor(Y_test.values)
Combine features and labels for each of the train and test sets into their respective PyTorch DataLoader objects:
# combine X_ and Y_ into a single dataset
train_set = TensorDataset(X_train, Y_train)
train_set = DataLoader(train_set,
batch_size=128,
shuffle=True)
test_set = TensorDataset(X_test, Y_test)
test_set = DataLoader(test_set,
batch_size=128,
shuffle=True)
Finally, we can save these train and test DataLoaders to our working directory for fast retrieval later on. The torch.save() method uses the same process of byte serialization that joblib and pickle modules use:
data = [train_set, test_set]
torch.save(data, 'data/philly_payments_clean')
Now that we have preprocessed the data, let's model it. We can define some autoencoder classes with the help of PyTorch, a helpful open source library for deep learning. A brief note about PyTorch: it works by generating directed acyclic graphs (DAG) of the neural network at each iteration and uses an autograd protocol that computes gradients of the operations recorded in the graph for efficient updates of the parameters during training.
Returning to our objective, if we can project financial transaction vectors into a subspace that defines most of the variance, we'll have organized the data in such a way that transactions are separated into concisely summarized groups representing the whole population. This representation might provide a better perspective of the underlying patterns in the data or amplify anomalous items to facilitate detection. What, then, is the best low-dimensional representation of our population and how do we find it?
In general, we can use unsupervised learning to identify the latent (underlying) structure of our data. Popular unsupervised learning techniques include principal components analysis (PCA), singular value decomposition (SVD), and t-SNE. Another technique for learning low-dimensional representations is through training an autoencoder.
Traditional Autoencoder Architechture

Autoencoders are considered an unsupervised technique in that learning happens without respect to known target labels. Figure 1 shows the typical structure of an autoencoder, consisting of two networks: an encoder and a decoder. These two networks meet in the middle at a matrix $Z$, the latent representation we are interested in.
To train an autoencoder, we begin by embedding transactional data in a vector space, represented as the matrix $X$. Batch samples of $X$ are passed through the encoder, which maps $X$ to the matrix $Z$. $Z$ is then passed through the decoder to reconstruct the original input as $X^*$. The difference between $X$ and its reconstruction $X^*$ is the reconstruction loss, defined as follows:

where:
$d(e(x)) =$ the reconstructed output of the composite encoder and decoder network
$||*||^{2}_{2} = $ the L2-norm squared or the sum of squared errors between $X$ and $X^*$
Alternatively, we could use a variational autoencoder to model the data. This kind of autoencoder is distinct from a traditional autoencoder in that there are generally two output layers of the encoder that represent the mean and standard deviation of a distribution. Further, the latent matrix $Z$ is determined by sampling from a Gaussian distribution parameterized by the means and standard deviations sampled from the encoder's output layers, as seen below:
Variational Autoencoder Architechture

With variational autoencoders, it is common to add an additional term to the loss function that penalizes the learned parameters for having divergent means and small standard deviations. This term is known as the Kullback–Leibler divergence and is a measure of how different two probability distributions are from one another. In simple terms, it is a measure of how much two distributions overlap. The he KL divergence between $p(z \mid x)$ and $\mathcal{N}(0, 1)$, can be defined as

Source: Alexander Van de Kleut, Variational Autoencoders (VAE) with PyTorch
We combine the $\mathbb{KL}$ loss with the sum of squared errors from the traditional autoencoder to define the loss function for the variational autoencoder:

where
$\gamma$ (gamma) is a parameter we can tune to adjust the amount of optimization applied with regard to the $\mathbb{KL}$ loss
With that, let's implement this an explore the results. I prepared a module called models.py that contains the AE and VAE classes we'll be using. Both are simple networks, with two 512-dimensional hidden intermediary layers and one 2-dimensional latent layer. The VAE has two additional 2-dimensional layers to sample from. I'll run the preprocessing.py module here to demostrate that as well:
from scripts.models import AE, VAE
from scripts.preprocessing import Preprocessor
df = df = pd.read_csv('data/city_payments_fy2017.csv')
pre = Preprocessor()
data = pre.process(df)
Cleaning data... Label encoding the data... Sampling from the original data... One-hot-encoding the data... Generating tensors... Combining tensors into one dataset... Preprocessing complete! Total Time: 0.25 minutes
Now we train a traditional autoencoder
latent_dims = 2
input_dims = data.X_train.shape[1]
output_dims = data.X_train.shape[1]
model1 = AE(input_dims, latent_dims, output_dims, batch_norm=False)
model1.fit(data, epochs=20)
epoch: [1/20] | loss: 710.4146 | elapsed time: 0.55 minutes epoch: [2/20] | loss: 634.1962 | elapsed time: 0.53 minutes epoch: [3/20] | loss: 620.7513 | elapsed time: 0.53 minutes epoch: [4/20] | loss: 601.9883 | elapsed time: 0.53 minutes epoch: [5/20] | loss: 593.5554 | elapsed time: 0.53 minutes epoch: [6/20] | loss: 565.4938 | elapsed time: 0.54 minutes epoch: [7/20] | loss: 555.7859 | elapsed time: 0.54 minutes epoch: [8/20] | loss: 577.5383 | elapsed time: 0.55 minutes epoch: [9/20] | loss: 561.3287 | elapsed time: 0.61 minutes epoch: [10/20] | loss: 574.3048 | elapsed time: 0.84 minutes epoch: [11/20] | loss: 529.9324 | elapsed time: 1.0 minutes epoch: [12/20] | loss: 516.5233 | elapsed time: 0.85 minutes epoch: [13/20] | loss: 528.6144 | elapsed time: 0.74 minutes epoch: [14/20] | loss: 575.4102 | elapsed time: 0.83 minutes epoch: [15/20] | loss: 531.4222 | elapsed time: 1.0 minutes epoch: [16/20] | loss: 527.0771 | elapsed time: 1.08 minutes epoch: [17/20] | loss: 534.6619 | elapsed time: 1.1 minutes epoch: [18/20] | loss: 516.548 | elapsed time: 0.95 minutes epoch: [19/20] | loss: 533.3744 | elapsed time: 0.82 minutes epoch: [20/20] | loss: 511.7211 | elapsed time: 0.72 minutes
model1
AE( (bn1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (bn2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=4845, out_features=512, bias=True) (linear2): Linear(in_features=512, out_features=2, bias=True) (linear3): Linear(in_features=2, out_features=512, bias=True) (linear4): Linear(in_features=512, out_features=4845, bias=True) )
In the AE.fit() method, we train the autoencoder, but we also include a loop to validate the model on test data unseen during training. Here is a plot of both train and test loss:
model1.plot_loss()
We observe that the loss dramtically drops after only 20 batches. It then very gradually declines. Let's see what the embedding looks like:
model1.plot_projection(data, which='test_set')
model2 = AE(input_dims, latent_dims, output_dims, batch_norm=True)
model2.fit(data, epochs=20)
epoch: [1/20] | loss: 708.7727 | elapsed time: 0.55 minutes epoch: [2/20] | loss: 686.2775 | elapsed time: 0.54 minutes epoch: [3/20] | loss: 656.3316 | elapsed time: 0.54 minutes epoch: [4/20] | loss: 654.1249 | elapsed time: 0.54 minutes epoch: [5/20] | loss: 677.3488 | elapsed time: 0.54 minutes epoch: [6/20] | loss: 645.1639 | elapsed time: 0.54 minutes epoch: [7/20] | loss: 649.9794 | elapsed time: 0.54 minutes epoch: [8/20] | loss: 594.2822 | elapsed time: 0.54 minutes epoch: [9/20] | loss: 619.0198 | elapsed time: 0.54 minutes epoch: [10/20] | loss: 637.7162 | elapsed time: 0.54 minutes epoch: [11/20] | loss: 633.1409 | elapsed time: 0.54 minutes epoch: [12/20] | loss: 645.791 | elapsed time: 0.54 minutes epoch: [13/20] | loss: 606.5604 | elapsed time: 0.54 minutes epoch: [14/20] | loss: 592.6423 | elapsed time: 0.54 minutes epoch: [15/20] | loss: 612.3545 | elapsed time: 0.54 minutes epoch: [16/20] | loss: 592.6324 | elapsed time: 0.54 minutes epoch: [17/20] | loss: 585.6572 | elapsed time: 0.54 minutes epoch: [18/20] | loss: 598.5024 | elapsed time: 0.53 minutes epoch: [19/20] | loss: 589.979 | elapsed time: 0.55 minutes epoch: [20/20] | loss: 557.442 | elapsed time: 0.54 minutes
model2.plot_loss()
model2.plot_projection(data, which='test_set')
model3 = VAE(input_dims, latent_dims, output_dims, batch_norm=True)
model3.fit(data, epochs=20, gamma=1)
epoch: [1/20] | loss: 1726.2412 | elapsed time: 0.54 minutes epoch: [2/20] | loss: 1203.718 | elapsed time: 0.56 minutes epoch: [3/20] | loss: 1166.6509 | elapsed time: 0.55 minutes epoch: [4/20] | loss: 1075.7648 | elapsed time: 0.58 minutes epoch: [5/20] | loss: 1021.8521 | elapsed time: 0.55 minutes epoch: [6/20] | loss: 1015.1688 | elapsed time: 0.59 minutes epoch: [7/20] | loss: 1038.255 | elapsed time: 0.55 minutes epoch: [8/20] | loss: 1014.0497 | elapsed time: 0.55 minutes epoch: [9/20] | loss: 1032.7101 | elapsed time: 0.54 minutes epoch: [10/20] | loss: 1018.7817 | elapsed time: 0.56 minutes epoch: [11/20] | loss: 1019.3304 | elapsed time: 0.57 minutes epoch: [12/20] | loss: 1017.0781 | elapsed time: 0.55 minutes epoch: [13/20] | loss: 1009.1345 | elapsed time: 0.54 minutes epoch: [14/20] | loss: 1009.4597 | elapsed time: 0.55 minutes epoch: [15/20] | loss: 1011.3191 | elapsed time: 0.57 minutes epoch: [16/20] | loss: 1022.9342 | elapsed time: 0.55 minutes epoch: [17/20] | loss: 1010.7504 | elapsed time: 0.54 minutes epoch: [18/20] | loss: 1001.7854 | elapsed time: 0.54 minutes epoch: [19/20] | loss: 1041.136 | elapsed time: 0.55 minutes epoch: [20/20] | loss: 980.733 | elapsed time: 0.54 minutes
model3.plot_loss()
model3.plot_projection(data, which='test_set')
model4 = VAE(input_dims, latent_dims, output_dims, batch_norm=True)
model4.fit(data, epochs=20, gamma=0.001)
epoch: [1/20] | loss: 762.7522 | elapsed time: 0.56 minutes epoch: [2/20] | loss: 695.9151 | elapsed time: 0.54 minutes epoch: [3/20] | loss: 684.8369 | elapsed time: 0.54 minutes epoch: [4/20] | loss: 645.6278 | elapsed time: 0.54 minutes epoch: [5/20] | loss: 640.1102 | elapsed time: 0.55 minutes epoch: [6/20] | loss: 589.3079 | elapsed time: 0.55 minutes epoch: [7/20] | loss: 627.2583 | elapsed time: 0.55 minutes epoch: [8/20] | loss: 632.123 | elapsed time: 0.55 minutes epoch: [9/20] | loss: 603.3551 | elapsed time: 0.55 minutes epoch: [10/20] | loss: 619.6804 | elapsed time: 0.56 minutes epoch: [11/20] | loss: 639.6083 | elapsed time: 0.55 minutes epoch: [12/20] | loss: 617.1934 | elapsed time: 0.55 minutes epoch: [13/20] | loss: 556.191 | elapsed time: 0.55 minutes epoch: [14/20] | loss: 625.5158 | elapsed time: 0.55 minutes epoch: [15/20] | loss: 616.4233 | elapsed time: 0.55 minutes epoch: [16/20] | loss: 572.6877 | elapsed time: 0.55 minutes epoch: [17/20] | loss: 578.4244 | elapsed time: 0.55 minutes epoch: [18/20] | loss: 605.4987 | elapsed time: 0.55 minutes epoch: [19/20] | loss: 577.6849 | elapsed time: 0.55 minutes epoch: [20/20] | loss: 562.2061 | elapsed time: 0.55 minutes
model4.plot_loss()
model4.plot_projection(data, which='test_set')
Finally, we can recover the labels and sort the transactions by reconstruction loss to identify potential outliers:
recovered_data = pd.concat([
data.recover_labels(model4.labels[:,:-1]),
pd.DataFrame(model4.labels[:,-1].numpy(),
columns=['reconstruction_loss'])
], axis=1)
top_ten = recovered_data.sort_values(by='reconstruction_loss', ascending=False).head(10)
top_ten.transaction_amount = top_ten.transaction_amount.apply(lambda x: '%.2f' % x)
top_ten
| document_no | department_title | character_title | sub_obj_title | vendor_name | doc_ref_no_prefix_definition | contract_description | payment_method | weekday | day | month | year | transaction_amount | amount_sign | reconstruction_loss | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 661 | CHEK17009282 | SINKING FUND COMMISSION | DEBT SERVICE | PRINCIPAL PAYMENTS ON CITY DEBTL/T | WATER & SEWER REVENUE BOND SINKING FUND | payment voucher | None | CHEK | Thursday | 28 | July | 2016 | 62040000.00 | 1.0 | 13.881687 |
| 1118 | ACHD17170973 | COMMERCE | REAL PROPERTY | PROF SRVAIRFIELD RUNWAYS/IMPROVEM | WESTON SOLUTIONS, INC. | professional services | RENEWAL TWO | ACHD | Friday | 19 | May | 2017 | 12361.93 | 1.0 | 13.420626 |
| 1089 | CHEK17004136 | FIRE | MATERIALS AND SUPPLIES | DRY GOODS/NOTIONS/WEARING APPAREL | ADAM K THIEL | reimbursable expense vouchers for City employees | None | CHEK | Thursday | 14 | July | 2016 | 838.34 | 1.0 | 13.157266 |
| 1007 | CHEK17094091 | MDO-OFFICE OF TECHNOLOGY | REAL PROPERTY | COMPUTER EQUIPMENT | NU VISION TECHNOLOGIES LLC | procurement | Purchase and Installation of Telephone Hardwar... | CHEK | Thursday | 09 | February | 2017 | 176886.88 | 1.0 | 12.605608 |
| 698 | ACHD17185928 | WATER | PURCHASE OF SERVICES | ARCHITECTURAL & ENGINEERING SRVCS | HAZEN AND SAWYER | professional services | Capital Improvements | ACHD | Friday | 16 | June | 2017 | 49153.67 | 1.0 | 12.494843 |
| 437 | CHEK17028129 | STREETS | EQUIPMENT | VEHICLES MOTOR AND MOTORLESS | CLEVELAND BROTHERS EQUIPMENT CO | procurement | CLEVELAND BROTHERS EQUIPMENT CO Bid # 00011420 | CHEK | Thursday | 08 | September | 2016 | 135744.00 | 1.0 | 12.241909 |
| 239 | CHEK17073038 | RECREATION | REAL PROPERTY | PROF SRVSWIMMING POOLS | BUELL KRATZER POWELL | professional services | Architectural Requirement | CHEK | Thursday | 15 | December | 2016 | 35543.48 | 1.0 | 12.149884 |
| 407 | ACHD17173297 | OFFICE OF BH/MR SERVICES | MATERIALS AND SUPPLIES | PRECISION PHOTOGRAPHIC AND ARTISTS | XEROX CORPORATION | payment voucher | None | ACHD | Friday | 26 | May | 2017 | 70.00 | 1.0 | 12.106859 |
| 537 | CHEK17063829 | OFFICE OF SUPPORTIVE HOUSING | REAL PROPERTY | PROF SRVEMER FIRE SAFETY EQUIP | BURRIS ENGINEERS, INC. | professional services | Engineering Requirements | CHEK | Friday | 25 | November | 2016 | 804.38 | 1.0 | 11.962860 |
| 913 | JEXX17000388 | CITY CONTROLLER | PURCHASE OF SERVICES | SEMINAR AND TRAINING SESSIONS | AMERICAN CONTRACT COMPLIANCE ASSOCIATION | None | None | JEXX | Tuesday | 23 | August | 2016 | 650.00 | 1.0 | 11.815714 |
report = dp.Report(
dp.Text('## Top Ten Potential Outlier Transactions'),
dp.Text('### Ranked by Reconstruction Loss'),
dp.Table(top_ten),
)
report.upload(name='potential_outliers', visibility=dp.Visibility.DEFAULT)
report = dp.Report(dp.Plot(model4.plot_projection(data, which='test_set')))
report.save('vae_batch_norm_gamma_1_proj.html', open=True)
report.upload(name='vae_batch_norm_gamma_1_proj',
visibility=dp.Visibility.DEFAULT,
formatting=dp.ReportFormatting(
width=dp.ReportWidth.NARROW)
)
Report saved to ./vae_batch_norm_gamma_1_proj.html
Uploading report and associated data - please wait...
Your report only contains a single element - did you know you can include additional plots, tables and text in a single report? More info here
Let's return to our original motivation, which is to learn a new way of seeing financial transactions to identify a representative sample or even focus sampling on areas of higher risk. We saw that by modeling transactions using a traditional autoencoder, as well as a variational autoencoder, could help facilitate this by learning a lower-dimensional representation of the data that disentangles the features we are interested in or even unaware of. Through experimentation we identified the various parameters we can tune and how to improve model performance. We also looked at how to rank transactions based on their reconstruction loss to identify potential outliers.
Challenges and opportunities going forward include implementing this on a larger data set, which would almost certainly need to happen in a distributed way. We could use Spark or Hadoop for this. It would also be useful to deploy this kind of pipeline to a secure web application where businesses could upload their data for automatic embedding and dashboards.